Conversation
PDFs padded with null bytes beyond %%EOF (common when uploaded through systems that pad to block boundaries) caused startxref lookup to fail because the 1024-byte search window fell entirely within padding. Skip trailing whitespace to find the effective end of file before searching. Fixes #54.
During brute-force recovery, IndirectObjectParser has no lengthResolver, so streams with indirect /Length references (e.g. /Length 42 0 R) would fail to parse. This prevented object streams from being read, making their compressed objects invisible to recovery. Now scans forward for the endstream keyword as a fallback, matching the approach used by pdf.js and PDFBox. Partial fix for #54.
The LRU cache (max 10k) could evict PdfName instances still held as keys in PdfDict's Map<PdfName, PdfObject>, causing silent lookup failures via reference inequality. This manifests in long-running servers processing many PDFs with diverse name sets. Replace with a WeakRef-based cache (matching PDFBox's COSName approach): names stay interned as long as any live object holds a strong reference, and a FinalizationRegistry cleans up dead entries. Also expands the permanent cache with trailer keys (Root, Size, Info, Prev, ID, Encrypt) and high-frequency names (Subtype, Font, BaseFont, Encoding, XObject, Annots, Names). Closes #54.
Contributor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
Benchmark ResultsComparisonLoad PDF
Create blank PDF
Add 10 pages
Draw 50 rectangles
Load and save PDF
Load, modify, and save PDF
Extract single page from 100-page PDF
Split 100-page PDF into single-page PDFs
Split 2000-page PDF into single-page PDFs (0.9MB)
Copy 10 pages between documents
Merge 2 x 100-page PDFs
Fill FINTRAC form fields
Fill and flatten FINTRAC form
CopyingCopy pages between documents
Duplicate pages within same document
Merge PDFs
Drawingbenchmarks/drawing.bench.ts
Formsbenchmarks/forms.bench.ts
Loadingbenchmarks/loading.bench.ts
Savingbenchmarks/saving.bench.ts
SplittingExtract single page
Split into single-page PDFs
Batch page extraction
Environment
Results are machine-dependent. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #54.
findStartXRefmissesstartxrefbehind trailing null paddingSome systems pad PDFs with null bytes after
%%EOF. The 1024-byte backward search lands entirely in padding. Now we skip trailing whitespace first, then search.Brute-force recovery fails on streams with indirect
/LengthIndirectObjectParserduring recovery has nolengthResolver, so/Length 42 0 Rthrows. If that stream is an ObjStm, its compressed objects are lost. Now catches the failure and scans forendstream.PdfName LRU evicts names still held as PdfDict keys
PdfDictusesMap<PdfName, PdfObject>reference equality. The 10k LRU could evict names still in use as keys, sodict.get("Root")silently returnsundefined. Replaced with WeakRef + FinalizationRegistry.Names stay cached as long as anyone holds a reference. Load test confirms the old code breaks under pressure.